Chromosome-level genome assembly of the diploid oat species Avena longiglumis

Diploid wild oat Avena longiglumis has nutritional and adaptive traits which are valuable for common oat (A. sativa) breeding. The combination of Illumina, Nanopore and Hi-C data allowed us to assemble a high-quality chromosome-level genome of A. longiglumis (ALO), evidenced by contig N50 of 12.68 Mb with 99% BUSCO completeness for the assembly size of 3,960.97 Mb. A total of 40,845 protein-coding genes were annotated. The assembled genome was composed of 87.04% repetitive DNA sequences. Dotplots of the genome assembly (PI657387) with two published ALO genomes were compared to indicate the conservation of gene order and equal expansion of all syntenic blocks among three genome assemblies. Two recent whole-genome duplication events were characterized in genomes of diploid Avena species. These findings provide new knowledge for the genomic features of A. longiglumis, give information about the species diversity, and will accelerate the functional genomics and breeding studies in oat and related cereal crops.


Background & Summary
Common oat (Avena sativa L.) and its wild relatives (2x, 4x,and 6x) are members of the Aveneae tribe (Poaceae).Clinical studies have shown the beneficial effects of consuming oats that can reduce serum cholesterol and cardiovascular disease, attributed to the soluble β-glucan component 1 .Oats also exhibit a favourable glycaemic index, with a low value and slow carbohydrate breakdown.Plant oils derived from cereal seeds are vital agricultural commodities used for food, feed, and fuel.Oat endosperm has between 6-18% oil content, which is significantly higher than other cereals [averaging 2.41% in barley (Hordeum vulgare) and 2.18% in wheat (Triticum aestivum)] 2,3 .The high oil content of oat grain suggests a possible important use for food oils and in animal feeds 4 .Despite the unique composition, global oat production has steadily declined over the past 50 years to 25 million tons in 2023 (http://www.fao.org/faostat/),suggesting the genetic improvement has lagged behind major cereal crops such as rice, wheat, and maize, making the crop less desirable to grow.There are therefore likely to be substantial opportunities for improvement of oat varieties.
Not least due to the large genome size of A. sativa (10.3 Gb) 5 , oat genomic research lags behind that of other crops such as rice (Oryza sativa) 6 , sorghum (Sorghum bicolor) 7 or foxtail millet (Setaria italica) 8 .There is an urgent need for the characterization, exploitation and utilization of wild oat germplasm resources for oat and related crop breeding 9,10 .A diploid genome of A. longiglumis Durieu (Fig. 1) reveals novelty in target genes and regulatory sequences, such as those for β-glucan synthesis, high linoleic content in grains, drought-adapted phenotypes, and resistance to crown rust disease 11 .The rapidly developing field of structural variation requires multiple high-quality chromosome-scale assemblies to show the nature of intraspecific variation (individual, variety or populations), polymorphisms within and between diploid species and their related species, and generation of recent structural variations in polyploid species derived from diploid ancestors.
This study utilized a combination of Illumina, Oxford Nanopore Technology (ONT) sequencing, and chromosome conformation capture (Hi-C) data to create a superior chromosome-scale genome assembly of diploid A. longiglumis (ALO; Fig. S1).Its genome assembly had a length of approximately 3,960.97Mb (Table 1 and S1), which is slightly smaller than the genome size estimated by k-mer analysis (Fig. S2).Through scaffolding contigs into seven super-scaffolds, the 98.84% of reads were anchored.As observed in the Hi-C heatmap, the seven super-scaffolds were mapped to the corresponding seven pseudo-chromosomes (Fig. 2).Among A. longiglumis genome sequences, 87.04% were classified as known repetitive DNA elements (Table 2), showing increased density in broad centromeric regions (Fig. 3 circle b).Compared to the published assembly results of tetraploid A. insularis and hexaploid oat genomes 5,9 , the diploid A. longiglumis genome in this study exhibits superior sequence continuity, as evidenced by higher contig N50 value of 12.68 Mb and scaffold N50 value of 527.34 Mb, respectively (Table 3), indicating a high assembly quality of the diploid genome, ensuring the reliability of subsequent research.
The BUSCO 12 results revealed the retrieval of 99.0% of the complete single-copy genes, of which 16.3% were duplicated, indicating high genome assembly completeness of our A. longiglumis_CN58138 (Table S2).Compared to other diploid assemblies of A. longiglumis_CN58138 (93.0%) and A. eriantha (94.0%) (Extended Data Fig. 2a of ref. 5 ), our diploid A. longiglumis_PI657387 genome exhibited a higher proportion of complete orthologous genes, comprising 99.0% of the genome assembly (Fig. 4).Compared to tetraploid A. insularis (7.9%) and hexaploid A. sativa (11.2%), the A. longiglumis genome in our study exhibits a higher proportion of single-copy orthologous genes, comprising 82.7% of the genome assembly (Fig. 4).In addition, the fragmented genes in this diploid genome display a similarity (0.2%) to those found in A. sativa.A total of 40,845 protein-coding genes were annotated for A. longiglumis using databases of NCBI NR (Non-redundant protein) 13 , EggNOG (Evolutionary genealogy of genes: non-supervised orthologous groups) 14 , Pfam (Pfam protein families) 15 , COG (Clusters of orthologous groups) 16 , SwissProt (Swiss Institute of Bioinformatics and Protein Information Resource) 17 , GO (Gene ontology) 18 , KOG (EuKaryotic orthologous groups) 19 , KEGG (Kyoto encyclopedia of genes and genomes) 20 , PlantTFDB (Plant transcription factor) 21 , and CAZy (Carbohydrate-Active enZYmes) 22 (Table S3).Dotplots of our A. longiglumis assembly were compared with two published genomes of A. longiglumis 5,9 , indicating the conservation of gene order and equal expansion of all syntenic blocks among three ALO genome assemblies (Fig. 5a,b).S1).illumina sequencing and genome survey.Pair-end genome sequencing with a 350 bp insert size used Illumina TruSeq ® Nano DNA library preparation kit (Illumina, San Diego, CA, USA) and libraries were sequenced on an Illumina NovaSeq 6000 platform (Table S1).Fastp v.0.23.2 23 was utilized to remove contaminants, Illumina adapters, and low-quality reads.The 268.60 Gb clean data were processed via Kmerfreq_AR v.2.0.4 24 .The 17-bp k-mers with Illumina reads counted using Jellyfish v.2.2.6 25 with default parameters.The genome size of 3.966 Gb, a heterozygosity of 0.48%, and repeat content were estimated using GenomeScope v.2.0 26 (Fig. S2).
ONT sequencing and genome assembly.The genomic DNA (10 μg) was broken into fragments around 10-50 kb long with the use of a g-TUBE device (Covaris, Inc., MA, USA) and size selection with BluePippin (Sage Science, Inc., MA, USA).To prepare the ONT PromethION (Genome Centre of Grandomics, Wuhan, China) sequencing libraries, DNA end repair was carried out by utilizing the NEBNext End Repair/dA-Tailing Module (New England Biolabs, MA, UK), and the ligation sequencing kit (SQK-LSK109, ONT, UK) (Table S1).
Hi-C sequencing and chromosome-level genome assembly.For Hi-C sequencing, 3-week-old leaves of A. longiglumis seedlings were fixed in 2% formaldehyde solution to obtain nuclear/chromatin samples.DpnII enzyme (Cat.E0543L, NEB, UK) was utilized to digest these fixed tissues.Hi-C libraries were then constructed and sequenced on the Illumina Novaseq 6000 platform to generate 150 bp paired-end reads (Table S1).High-quality reads were extracted and aligned to the reference genome assembly using Bowtie2 v.2.3.2 30 .Juicer v.2.0 31 was utilized to create a de-duplicated listing of alignments of Hi-C reads to the draft A. longiglumis assembly.HiC-Pro v.2.7.8 32 was used to determine the ligation site for each unmapped read, after which the 5' fragments were aligned to the genome assembly.
A single alignment file was generated by merging the results of both mapping steps, and low-quality reads were discarded, which included reads with multiple matches, singletons, and mitochondrial DNA.Valid pairs of interaction were employed in scaffolding the assembled contigs into 7 pseudo-chromosomes utilizing the LACHESIS pipeline 33 .The quality and completeness of the genome assembly were evaluated by utilizing BUSCO v.5.4.6 12 (Table S2).In addition, the chromosome matrix was depicted as a heatmap that manifested diagonal patches of robust linkage.
Gene prediction and functional annotation.Gene structure prediction relied on three distinct approaches that were applied, including ab initio prediction, homology-based prediction, and RNA-seq-assisted prediction 43 .The de novo-based gene prediction was carried out using Augustus v.3.4.0 44 with default parameters, to predict A. longiglumis-assembled genes.Furthermore, the homology-based prediction was performed by GeMoMa v.1.6.1 45 with default parameters, utilizing filtered proteins from genomes of six species (Arabidopsis thaliana 46 , Brachypodium distachyon 47 , Hordeum vulgare 48 , Sorghum bicolor 7 , Triticum aestivum 49 and Zea mays 50 ).The RNA-seq-based gene prediction was executed using TransDecoder v.5.5.0 51 .High-confidence (HC) genes refer to both homology-based prediction supported by ≥ two species (1,083) and by RNA-seq-assisted prediction if the FPKM (Fragments Per Kilobase of exon model per Million mapped fragments) value > 0 (32,188).The predicted gene structures from each of these three approaches were integrated into consensus gene models using EVidenceModeler v.1.1.1 52 .The resulting gene models were then filtered to obtain a precise gene set, whereby genes with transposable element sequences were removed using TransposonPSI v.1.0.0 (http://transposonpsi.sourceforge.net/).
Non-coding RNA annotation.The prediction of the non-coding RNA gene set (ncRNA) was carried out across the genome.Initially, the data was aligned with the noncoding database of Rfam library v.11.0 55 , for the annotation of genes encoding various non-coding RNAs including small nuclei RNAs (snRNAs), ribosomal RNAs (rRNAs), and microRNAs (miRNAs).The transfer RNA (tRNA) sequences were subsequently identified using tRNAscan-SE v.2.0 56 (Table 1).

Pairwise comparisons of genome assemblies.
To create the dotplots of A. longiglumis, the reference sequence of CN58138 5 and CN58139 9 were aligned with the de novo assembly of PI 657387 using Minigraph v. 2.25 57 , respectively, with the '-ax asm5' option, resulting in a PAF alignment file.The PAF file was uploaded to D-Genies v.1.5.0 58 to create the dotplot using their default setting.Dotplots of the assembly (accession PI657387) were compared with two published genomes of A. longiglumis, indicating the conservation of gene order and equal expansion of all syntenic blocks among three genome assemblies (Fig. 5a,b).references

Fig. 1
Fig. 1 The spikelet of Avena longiglumis.Two glumes nearly equal in length (left), the first (middle) and the second (right) florets disarticulated with 2-3 mm awl-shaped callus at the floret base together with 8-12 mm bristles at the lemma tip.Scale bar, 1 cm.

Fig. 2
Fig. 2 Genome-wide chromatin interaction heatmap (100 kb bins) of diploid A. longiglumis (ALO, PI657387) based on Hi-C data showing chromosome-scale continuity of the assembly.Small shaded circles denoted the centromeric locations.

Fig. 5
Fig. 5 Pairwise comparisons of dotplots for three Avena longiglumis (ALO) genome assemblies and the diploid Avena species genomes.(a) ALO_PI657387and ALO_CN58138 (Kamal et al.22  ).(b) ALO_PI657387 and ALO_ CN58139 (Peng et al.9 ).The dotplots provide insights into the conservation of gene order and the genomic rearrangements among three A. longiglumis genome assemblies.The x-and y-axes represent the genomic coordinates of each species.

Table 1 .
Genome assembly statistics and gene predictions in the Avena longiglumis genome.

Table 2 .
Repetitive DNA composition of the Avena longiglumis genome.

Table 3 .
Summary of genome assemblies of Avena longiglumis of this study and published tetraploid A. insularis and hexaploid A. sativa.-: unavailable data.National Botanical Garden, Guangzhou, China.Young leaves were collected for DNA isolation and whole-genome sequencing.The leaves and roots were collected for RNA-sequencing (RNA-seq) and transcriptome assembly.The samples were immediately flash-frozen in liquid nitrogen after harvest, and stored at −80 °C for subsequent nucleic acid extraction.The extraction and purification of RNA were carried out utilizing the Qiagen RNeasy Plant Mini Kit (Qiagen, CA, USA), following the instructions of the manufacturer, one of 8 Gb and one of 10 Gb pair-end read data were obtained.A total of 511.4 Gb Oxford Nanopore Technology (ONT) long reads (~128.9× coverage), 435.6 Gb Hi-C reads (~109.8× coverage), 268.6 Gb (~67.7 × coverage) paired-end Illumina reads, and 99.0 Gb RNA-seq reads were generated for the genome assembly, genome survey, and transcriptome assembly (Table